{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "uvm.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "Installation\n",
        "Requirements:\n",
        "*   python >= 3.7\n",
        "\n",
        "We highly recommend CUDA when using torchRec. If using CUDA:\n",
        "*   cuda >= 11.0"
      ],
      "metadata": {
        "id": "KvT8qgk7Quw5"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "TqfI1tUuQpR_"
      },
      "outputs": [],
      "source": [
        "# install conda to make installying pytorch with cudatoolkit 11.3 easier. \n",
        "!wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh\n",
        "!chmod +x Miniconda3-py37_4.9.2-Linux-x86_64.sh\n",
        "!bash ./Miniconda3-py37_4.9.2-Linux-x86_64.sh -b -f -p /usr/local\n",
        "\n",
        "# install pytorch with cudatoolkit 11.3\n",
        "!conda install pytorch cudatoolkit=11.3 -c pytorch-nightly -y\n",
        "\n",
        "# install torchrec\n",
        "!pip3 install torchrec-nightly"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The following steps are needed for the Colab runtime to detect the added shared libraries. The runtime searches for shared libraries in /usr/lib, so we copy over the libraries which were installed in /usr/local/lib/. **This is a very necessary step, only in the colab runtime.**"
      ],
      "metadata": {
        "id": "rPSNpjELC-gB"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!cp /usr/local/lib/lib* /usr/lib/"
      ],
      "metadata": {
        "id": "HuF7H9ebRh2n"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "*Restart your runtime at this point for the newly installed packages to be seen.* Run the step below immediately after restarting so that python knows where to look for packages. Always run this step after restarting the runtime."
      ],
      "metadata": {
        "id": "RlhsxzDqC2pJ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import sys\n",
        "sys.path = ['', '/env/python', '/usr/local/lib/python37.zip', '/usr/local/lib/python3.7', '/usr/local/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/site-packages']"
      ],
      "metadata": {
        "id": "UGrMQgG_TFVh"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Distributed Setup\n",
        "\n",
        "We setup our environment with torch.distributed. For more info on distributed, see this [tutorial](https://pytorch.org/tutorials/beginner/dist_overview.html).\n",
        "\n",
        "Here, we use one rank (the colab process) corresponding to our 1 colab GPU."
      ],
      "metadata": {
        "id": "zc5Q0UieDNtf"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import os\n",
        "import torch\n",
        "import torchrec\n",
        "import torch.distributed as dist\n",
        "\n",
        "os.environ[\"RANK\"] = \"0\"\n",
        "os.environ[\"WORLD_SIZE\"] = \"1\"\n",
        "os.environ[\"MASTER_ADDR\"] = \"localhost\"\n",
        "os.environ[\"MASTER_PORT\"] = \"29500\"\n",
        "\n",
        "# Note - you will need a V100 or A100 to run tutorial!\n",
        "# If using an older GPU (such as colab free K80), \n",
        "# you will need to compile fbgemm with the appripriate CUDA architecture\n",
        "# or run with \"gloo\" on CPUs \n",
        "dist.init_process_group(backend=\"nccl\")"
      ],
      "metadata": {
        "id": "0OkrjOr-TLVY"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Unified Virtual Memory (UVM)\n",
        "[UVM](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) supports many interesting features. In this tutorial, we are interested in over-subscribing capability. We first construct embedding table, which will be 2x larger than what GPU can support. (e.g. for 40GB A100, we allocate a 80GB embedding table)"
      ],
      "metadata": {
        "id": "IB4fgMu2OWgb"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "gpu_device=torch.device(\"cuda\")\n",
        "hbm_cap_2x = 2 * torch.cuda.get_device_properties(gpu_device).total_memory\n",
        "\n",
        "embedding_dim = 8\n",
        "# By default, each element is FP32, hence, we divide by sizeof(FP32) == 4.\n",
        "num_embeddings = hbm_cap_2x // 4 // embedding_dim\n",
        "ebc = torchrec.EmbeddingBagCollection(\n",
        "    device=\"meta\",\n",
        "    tables=[\n",
        "        torchrec.EmbeddingBagConfig(\n",
        "            name=\"large_table\",\n",
        "            embedding_dim=embedding_dim,\n",
        "            num_embeddings=num_embeddings,\n",
        "            feature_names=[\"my_feature\"],\n",
        "            pooling=torchrec.PoolingType.SUM,\n",
        "        ),\n",
        "    ],\n",
        ")"
      ],
      "metadata": {
        "id": "3dIwv2yOWJjL"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "One can enforce to use UVM to avoid out-of-memory issue. To do that, one can add a constraint to force the planner to use only the BATCHED_FUSED_UVM compute kernel. DistributedModelParallel will shard according to this constraint. BATCHED_FUSED_UVM kernel puts the embedding table on UVM. UVM allocates the table in the host memory, but not on the GPU memory. When GPU tries to access the embedding table, GPU fetches the table at page granularity on-demand to serve the access. One can expect that performance will be slower than having the table entirely on the GPU memory."
      ],
      "metadata": {
        "id": "ytVUOAXfO5gu"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from typing import cast\n",
        "\n",
        "from torchrec.distributed.planner import EmbeddingShardingPlanner, Topology\n",
        "from torchrec.distributed.planner.types import ParameterConstraints\n",
        "from torchrec.distributed.embedding_types import EmbeddingComputeKernel\n",
        "from torchrec.distributed.embeddingbag import EmbeddingBagCollectionSharder\n",
        "from torchrec.distributed.types import ModuleSharder\n",
        "\n",
        "\n",
        "topology = Topology(world_size=1, compute_device=\"cuda\")\n",
        "constraints = {\n",
        "    \"large_table\": ParameterConstraints(\n",
        "        sharding_types=[\"table_wise\"],\n",
        "        compute_kernels=[EmbeddingComputeKernel.BATCHED_FUSED_UVM.value],\n",
        "    )\n",
        "}\n",
        "plan = EmbeddingShardingPlanner(\n",
        "    topology=topology, constraints=constraints\n",
        ").plan(\n",
        "    ebc, [cast(ModuleSharder[torch.nn.Module], EmbeddingBagCollectionSharder())]\n",
        ")\n",
        "\n",
        "uvm_model = torchrec.distributed.DistributedModelParallel(\n",
        "    ebc,\n",
        "    device=torch.device(\"cuda\"),\n",
        "    plan=plan\n",
        ")\n",
        "\n",
        "# Notice \"batched_fused_uvm\" in compute_kernel.\n",
        "print(uvm_model.plan)\n",
        "\n",
        "# Notice ShardedEmbeddingBagCollection.\n",
        "print(uvm_model)"
      ],
      "metadata": {
        "id": "iq6XnYwlWM_P"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's test whether we can run the model even if table size is larger than GPU memory."
      ],
      "metadata": {
        "id": "-oh-gwVtg_Jw"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "mb = torchrec.KeyedJaggedTensor(\n",
        "    keys = [\"my_feature\"],\n",
        "    values = torch.tensor([101, 202, 303, 404, 505, 606]).cuda(),\n",
        "    lengths = torch.tensor([2, 0, 1, 1, 1, 1], dtype=torch.int64).cuda(),\n",
        ")"
      ],
      "metadata": {
        "id": "7hY7esYahANL"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Because ShardedEmbeddingBagCollection returns EmbeddingCollectionAwaitable, wait() should be called to obtain tensor."
      ],
      "metadata": {
        "id": "jwddYnSckvtI"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "uvm_result = uvm_model(mb).wait()\n",
        "print(uvm_result)"
      ],
      "metadata": {
        "id": "e9R9EQRnWWfm"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# UVM Caching\n",
        "Default behavior when distributing in model parallel fashion is to use UVM caching when embedding table size exceeds GPU memory. UVM caching adds a software managed cache on GPU, which stores at a table row granularity. If the same table row is accessed frequently back-to-back, a cache hit will occur, hence achieving GPU memory performance. Otherwise, a cache miss will occur, and the table row needs to be fetched from host memory showing UVM performance."
      ],
      "metadata": {
        "id": "62w331nNfRWm"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "default_model = torchrec.distributed.DistributedModelParallel(\n",
        "    ebc,\n",
        "    device=torch.device(\"cuda\"),\n",
        ")\n",
        "\n",
        "# Notice \"batched_fused_uvm_caching\" in compute_kernel.\n",
        "print(default_model.plan)"
      ],
      "metadata": {
        "id": "-MyaVWPkN-EV"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Same as UVM case. EmbeddingCollectionAwaitable is returned, hence, wait() should be called to obtain tensor.\n",
        "default_result = default_model(mb).wait()\n",
        "print(default_result)"
      ],
      "metadata": {
        "id": "tND4-wPuhNsw"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "One can also control UVM caching's caching ratio. By default, it is 0.2, which means that software managed cache size is 20% of embedding table size. If one wants to reduce it further, it can be provided as constraint."
      ],
      "metadata": {
        "id": "IKJx_0ncmKJG"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "uvm_caching_constraints = {\n",
        "    \"large_table\": ParameterConstraints(\n",
        "        sharding_types=[\"table_wise\"],\n",
        "        compute_kernels=[EmbeddingComputeKernel.BATCHED_FUSED_UVM_CACHING.value],\n",
        "        caching_ratio=0.05,\n",
        "    )\n",
        "}\n",
        "uvm_caching_plan = EmbeddingShardingPlanner(\n",
        "    topology=topology, constraints=uvm_caching_constraints\n",
        ").plan(\n",
        "    ebc, [cast(ModuleSharder[torch.nn.Module], EmbeddingBagCollectionSharder())]\n",
        ")\n",
        "\n",
        "uvm_caching_model = torchrec.distributed.DistributedModelParallel(\n",
        "    ebc,\n",
        "    device=torch.device(\"cuda\"),\n",
        "    plan=uvm_caching_plan\n",
        ")\n",
        "\n",
        "print(uvm_caching_model.plan)"
      ],
      "metadata": {
        "id": "IByZ5NmAmLMj"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "uvm_caching_result = uvm_caching_model(mb).wait()\n",
        "print(uvm_caching_result)"
      ],
      "metadata": {
        "id": "k3GSBOoamhgN"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}
